10:00
WiFi: Catalyst | 👩🏼💼 oxinn-event | 🔑 event-6YJh!
Hollie Johnson
National Innovation Centre for Data
8 March 2023
Our goal is to be a hub for data innovation and data skills. We are not academics, nor are we a traditional consultancy.
We wear many hats, but our main activity is running collaborative data skills projects with external organisations.
Organisations finish with valuable project output, but more importantly their people have gained new data skills to take forward to future projects.
…and many other things!
A successful data science project is one that delivers value to an organisation.
Value can look like:
Research shows most project failures are due to poor project management and scoping
Asking the right questions is as important as having the required technical capabilities
Defining the goal, objectives, deliverables and resources forms part of our popular Data Science Kick-off workshop.
As an organisation:
As an individual:
NLP (Natural Language Processing) is a field covering many language-based tasks, including
and is used for chatbots/virtual assistants, spam filtering, web search and text analysis.
‘Traditional’ NLP requires text preprocessing, since most machine learning models require numeric data as input.
The main stages of text preprocessing are
Data cleaning will vary depending on the context but may include tasks such as
Tokenisation is how we divide text into individual tokens. Commonly, a token corresponds to a word, however a tokens could also be
Stop words can be
Stemming is the replacement of words with a stem word. For instance
This can help to reduce the number of unique tokens. But be careful - meaning can be lost!
Lemmatisation is an alternative method that uses a morphological analysis of words to preserve meaning.
A unique token is assigned to each word in the text
"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]Common words can dominate, potentially without much information, so frequency can be penalised using Term Frequency - Inverse Document Frequency (TF-IDF), allowing distinct words in a given document to have more weight.
An unsupervised learning approach that learns continuous multidimensional vector representation for each word, by learning to predict a ‘centre word’ given a fixed side window around it
developers.google.com/machine-learning/crash-course/embeddings/translating-to-a-lower-dimensional-space
Modern NLP methods work very differently
Bidirectional Encoder Representations from Transformers
Can be used for
Not suitable for
BERT’s novelty lies in the way it was pre-trained:
These two tasks are trained concurrently.
jalammar.github.io/illustrated-bert
jalammar.github.io/illustrated-bert
We do not need to implement (or train) any of these ourselves.
huggingface.co/models
Many pre-trained models are available, some of which are also fine tuned for specific tasks, so we will take advantage of this and use a model from HuggingFace 🤗
There are post-its on the tables, please add your thoughts to the large sheets of paper on the walls
10:00
In GoogleColab, select File, then Open notebook
Paste in the url to the GitHub repository and select the notebook sentiment-analysis.ipynb
There are pre-saved data sets available at the same GitHub repository. To access the url for the raw data, navigate to the data set you wish to use and click raw.
Warning
The data set comprises actual tweets obtained using the free API. These have NOT been filtered for toxicity, profanity etc.
For our model, we will be using Twitter-roBERTa-base for Sentiment Analysis
We’re always eager to hear about real-life use cases of the content we share. If you end up using any of these materials in your organisation, please let us know how.
Feel free to contact me hollie.johnson@ncl.ac.uk
WiFi: Catalyst | 👩🏼💼 oxinn-event | 🔑 event-6YJh!